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Abstract 

We introduce a new nearest-prototype classifier, the prototype vec- 
tor mach/me (PVM). It arises from a c;onibinatorial optimization prob- 
lem which we cast as a variant of the set cover problem. We propose 
two algorithms for approximating its solution. The PVM selects a rel- 
atively small number of representative points which can then be used 
for classification. It contains 1-NN as a special case. The method is 
compatible with any dissimilarity measure, making it amenable to sit- 
uations in which the data are not embedded in an underlying feature 
space or in which using a non-Euclidean metric is desirable. Indeed, 
we demonstrate on the much studied ZIP code data how the PVM can 
reap the benefits of a probleni-spec;ific metric. In this example, the 
PVM outperforms the highly successful 1-NN with tangent distance, 
and does so retaining fewer than half of the data points. This exam- 
ple highlights the strengths of the PVM in yielding a low-error, highly 
interpretable model. Additionally, we apply the PVM to a protein 
classification problem in which a kernel-based distance is used. 

1 Introduction 

Suppose we are given a set of training points X = {xi, . . . ,x„} C with 
corresponding class labels yi, . . . , j/n £ i^-, - ■ ■ ,L} and, in addition, a set of 
unlabeled points Z = {zi, . . . , z^} c R^. Our goal is to choose a relatively 
small set of prototypes Pi ^ Z for each class I in such a way that the 
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collection Vi, . . . , Vl represents a summary or distillation of the training set 
(i.e., someone given only Vi, . . . , Vl would have a good sense of the original 
training data, X and y). While our default choice is Z = X, we find it 
notationally easier to differentiate between the two sets. When Z = X, we 
are in the standard setting of a condensation problem [Ripley 2005 



Having a well-selected set of prototypes Vi, . . . , Vl C Z is advantageous 
for two main reasons: interpretability and classification. For domain spe- 
cialists, examining a handful of representative examples of each class can 
be highly informative especially when n is large (since looking through all 
examples from the original data set could be overwhelming or even infeasi- 
ble). Intuitively, a well-chosen set Vi ^ Z oi prototypes for class / should 
capture the full spread of variation within this class while also taking into 
account how class / differs from other classes. Finally, the relative number 
of prototypes in each class should be determined by the complexity of that 
class. 

The other major use of the prototypes is for classification. Once we have 
prototype sets Vi, . . . , Vl, we may classify any new x G W according to the 
class whose Vi contains the nearest prototype: 

c(x) = argmin min d{x, z). (1) 
I zeP; 

Notice that this classification rule reduces to 1-nearest- neighbors (1-NN) in 
the case that Vi consists of all Xi £ X with yi = I. 

In this paper, we introduce the prototype vector machine (PVM), which 
describes a particular choice for the sets Vi, . . . , Vl ■ At its heart is the 
premise that Vi should consist of points that are close to many training 
points of class / and are far from training points of other classes. This 
intuition captures the sense in which the word "prototypical" is commonly 
used. 

In Section [2| we begin with a conceptually simple optimization criterion 
that describes a desirable choice for Vi, . . . ,Vl- We express this idea as 
an integer program and then in Section [3] present two approximation algo- 
rithms for it. Section |4] discusses considerations for applying the PVM most 
effectively to a given data set. In Section [5| we give an overview of related 
work. In Section H] we demonstrate the PVM's effectiveness — both in terms 
of classification accuracy and ease of interpretation — on a number of real 
data sets, including the much-studied ZIP code digits data set. 

Finally, a note on the name: The PVM has a number of similarities 
with the Support Vector Machine: sparsity in the samples and the slack 
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formulation. The PVM integer program is an extension of the set cover 
problem, which we review presently. 



1.1 The set cover integer program 

Consider the two sets X and Z but without the labels y. Let D be the 
n X m matrix of dissimilarities, with Dij = d(xi,Zj) for each Xi G X and 
Zj G Z (note: d need not be a metric), and fix e > 0. The goal is to find 
the smallest subset of points V Z such that every point Xj G is within 
e of some point in V (i.e., there exists Zj G V with d{xi,Zj) < e). Let 
i?e(x) = {x' G : (i(x',x) < e} denote the ball of radius e centered at x. 
Introducing the indicator variables 

fl ifzjGP 
ctj = < 

10 otherwise, 

this problem can be stated as an integer program: 

m 

minimize aj 

subject to Oj > 1 y Xi G X (2) 

i:XieSe(zj) 

Oj G {0, 1} V Zj G Z. 

The objective is simply \V\. The summation in the constraint counts the 
number of elements of V that are within e of the point Xj, so a feasible 
solution to the above integer program is one that has at least one prototype 
within e of each training point. 

From a machine learning point of view, set cover can be seen as a clus- 
tering problem in which we wish to find the smallest number of clusters such 
that every point is within e of at least one cluster center. In the language 
of vector quantization, it seeks the smallest codebook (restricted to Z) such 
that no vector is distorted by more than e [Tipping and Scholkpf 2001| . 



2 The prototype vector machine 

The prototype vector machine is an extension of the set cover problem to 
the supervised learning context (in which each Xj G A" has a class label j/,). 
The PVM seeks a set of prototypes for each class that is optimal in a sense 
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Figure 1: Given a value for e, the choice of Vi, . . . ,Vl induces L partial 
covers of the training points by e-halls centered at each prototype. Here e 
is varied from the smallest interpoint distance (upper-left) to approximately 
the median interpoint distance (lower-right). 



that will be made precise in what follows. For a given choice of Vi C Z, 
we consider the set of e-balls centered at each Zj G Vi (see Figure [T]) . A 
desirable prototype set for class I is one that induces a set of balls which 

(a) covers as many training points of class I as possible, 

(b) covers as few training points as possible of classes other than I, 
and (c) is sparse (i.e., uses as few prototypes as possible for the given e). 



2.1 PVM as an integer program 



We now express the three properties above as an integer program, taking as 
a starting point the set cover problem of Equation [2j Property (b) suggests 
that in certain cases it may be necessary to leave some points of class / 
uncovered. For this reason, we adopt a prize- collecting set cover framework 
for our problem (i.e., we assign a cost to each covering set, a penalty for 
being uncovered to each point, and then find the minimum-cost partial cover. 



Konemann et al. 



20061). Let af E {0,1} indicate whether we choose Zj to 



be in (i.e., to be a prototype for class /). We define the PVM to be a 
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solution to the following integer program: 



minimize ^/i + A a^P 



(0 t — — — 
subject to 

af'^ >l-^i V Xi G A" (3a) 

of < + r?i V Xi G A- (3b) 

j:Xie-Be(zj) 

G{0,1} Vzj GZ,/g{1,...,L} 

S,i,r]i>0 y Xi £ X 

We have introduced two slack variables, and r]i, per training point Xj. 



Constraint (3a) enforces that each training point be covered by at least one 



ball of its own class-type (otherwise = 1). Constraint (3b) expresses the 
condition that training point Xj not be covered with balls of other classes 
(otherwise r/j > 0). In particular, the slack variables can be interpreted as 



1 if Xj is not covered by a class-yj prototype ball 
otherwise 



• r]i = Number of prototypes covering Xj that are not of class . 

Finally, A > is a parameter specifying the cost of adding a prototype. 
Its effect is to control the number of prototypes chosen (corresponding to 
property (c) of the last section). We generally choose A = 1/n, so that 
property (c) serves only as a "tie-breaker" for choosing among multiple 
solutions that do equally well on properties (a) and (b). Hence, in words, 
we are minimizing the sum of (a) the number of points left uncovered, (b) 
the number of points wrongly covered, and (c) the number of covering balls 
(multiplied by A). The resulting method has a single tuning parameter, e 
(the ball radius), which can be estimated by cross validation. 

We show in the Appendix that the PVM integer program is equivalent to 
L separate prize-collecting set cover problems. Let Xi = {xj £ X : Ui = I}. 
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Then, for each class /, the set Vi Z is given by the solution to 

m 

minimize '^Ci{j)aj'' + ^ 
subject to a^'^ > 1 — V Xj G A:"; 

j;Xie_BE{zj) 

qJ'^ G {0, 1} V G Z 
ei>0 VxiG;fi 



where Ci{j) is the cost of adding Zj to and a unit penalty is charged for 
each point Xj of class I left uncovered. The cost of a covering set for the 
PVM is the number of miscovered points plus a baseline charge of A: 

Ci{j) = x + \B,{z,)n{x\Xi)\. 



3 Solving the problem: two approaches 



The prize-collecting set cover problem can be transformed to a standard set 
cover problem Konemann et al. 2006 , which is itself NP-hard, so we do not 



expect to find a polynomial-time algorithm to solve the general PVM prob- 
lem exactly. Further, certain inapproximability results have been proven 
for the set cover problem Feige 1998] p] In what follows, we present two 



algorithms for approximately solving our problem. 



3.1 LP relaxation with randomized rounding 

A well-known approach for the set cover problem is to relax the integer 
constraint a^p G {0,1} by replacing it with < a^P < 1. The result 
is a linear program (LP), which is convex and easily solved with any LP 
solver. The result is subsequently rounded to recover a feasible (though not 
necessarily optimal) solution to the original integer program. 

Let {a*^'^} denote a solution to the LP. Since our solution in general will 
be fractional, we adopt the following rounding strategy to produce an inte- 
gral solution: For each j G {1, . . . , m} and I G {1, . . . , L}, we independently 
draw ~ Bernoulli(a*^'^). Notice that a*^'"'^ G [0,1], so this approach is 

^We do not assume in general that the dissimilarities satisfy the triangle inequality, so 
we consider arbitrary covering sets. 
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well-defined. Let Si and Tj denote the slack (corresponding to and rji) 
incurred by the rounded solution {A^^^}. These random variables are given 

by 

f 1 if Xi uncovered ^ Ei:x,:eB.(z,) = 
otherwise 

The randomized rounding algorithm is as follows (with B typically in the 
hundreds) : 



For 6= 1,..., 5: 

1. Draw independently A^j\b) ~ Bernoulli(a*''''*). 

2. Find the corresponding Si{b), Ti{b) making this a feasible solution, 
(using Equation |4]) 

3. Evaluate objective OBJ{h) = E7=iiSi{b) +Ti{b)) + X A^-\b). 
Return {A^p{b)} with minimum OBJ{b). 

In the Appendix, we prove that the expected objective on any iteration 
satisfies 



E 



Y^[sm+Ub)) + \Y,Af{b) 



Ti ft 

<- + OPTlp <- + OPTip 
e e 



where OPTlp = Y^^=i (C + ^i*) + ^ Sj=i Ya=i "j^^^ is the optimal value of 
the LP (which is a lower bound on the integer program's optimal value). 

One disadvantage of this approach is that it requires solving an LP, 
which can be relatively slow and memory-intensive for large data sets. The 
approach we describe next is much lighter-weight and is thus our preferred 
method. 



3.2 A greedy approach 



Another well-known approximation algorithm for the set cover problem is 
the greedy algorithm [Vazirani 2001 . At each step, we add the prototype 



that has the least ratio of cost to number of points newly covered. However, 
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here we present a less standard greedy algorithm which has certain practical 
advantages over the standard greedy approach and does not in our experi- 
ence do noticeably worse in minimizing the PVM objective. At each step 
we find the Zj G Z and class / for which adding Zj to Vi most decreases the 
objective function. That is, we find the {zj,l) pair with the best tradeoff 
of covering previously uncovered training points of class / while avoiding 
covering points of other classes. The incremental improvement of going 
from (Vi, . . .,Vl) to (Pi, . . . ,Vi-i,Vi U {zj},Vi+i, . . . ,Vl) can be denoted 
by AObj(zj,/) = A^{zj,l) - Ar]{zj,l) - A where 



Arj{zj,l) = \B,{zj)n{X\Xi)\ 



The greedy algorithm is simply as follows: 

1. Start with Vi = 9 for each class /. 

2. While AObj(z*,/*) > 0: 

• Find (z*,r) = argmax(2^. ;) AObj(zj,/). 
« Let Vi* := Vi, U {z*}. 

4 Problem-specific considerations 

The PVM provides a considerable amount of flexibility that allows the user 
to tailor it to the particular problem at hand. 

4.1 Dissimilarities 

The PVM depends on all of the Xj and Zj only through the pairwise dissim- 
ilarities (i(xj,Zj) and can accept any matrix with non-negative entries. This 
allows it to share in the benefits of kernel methods by using a kernel-based 
distance]^ Also, for problems in the p ^ n realm, using distances that effec- 

■^Given a kernel K{x, x'), we can use the distance 

d{x,x') = ^K{x, x) + K{x',x') - 2K{x, x') 



8 



tively lower the dimension can lead to improvements. For instance, we have 
achieved gains in classification accuracy in some p 3> n simulations by us- 
ing the DANN-distance [Hastie and Tibshirani 1996 , which is a supervised 
measure of distance. Additionally, in certain problems (e.g., in proteomics, 
see Section 6.3) the data may not be readily embedded in a vector space. 
In such a case, we may still apply the PVM if pairwise dissimilarities are 
available. 

Finally, given any dissimilarity d, we may instead use d, defined by 
(i(x, z) = |{xj G X : d(xj,z) < d(x, z)}|. Using d induces e-balls -Be(zj) 
containing the ([ej — 1) nearest training points to Zj. 



4.2 Prototypes not on training points 

Another inherent flexibility of the PVM is in the choice of Z, the set of 
potential prototypes. While Z = X is a standard choice, we have experi- 
mented with other possibilities as well. For example, if we are also given a 
set of unlabeled data (e.g., a test set), we may add these examples as po- 
tential prototypes, yielding a semi-supervised version of the PVM. Doing so 
preserves the property that all prototypes are actual examples (rather than 
arbitrary points in R^). 

We believe that having prototypes confined to lie on actual observed 
points is desirable for interpretability. However, in circumstances in which 
this property is not needed, Z may be further augmented to include other 
points. For example, one could run means on each class's points indi- 
vidually (or on the training set as a whole) and add these L ■ K centroids 
to Z. This method seems to help especially in high dimensional problems 
where constraining all prototypes to lie on data points suffers from the curse 
of dimensionality. Another successful choice for Z is to sample uniformly 
within the convex hull of each class's training points. 



5 Related Work 



Before presenting the PVM's empirical performance on data sets, we dis- 
cuss its relation to several pre-existing methods. The PVM with Z = X 
selects a subset of the original training set as prototypes. In this sense, 
it is similar in spirit to condensing and data editing methods, such as the 
condensed nearest neighbor rule [Hart 1968 and multiedit [Devijver and 
Kittler, 1982 . Hart [1968 introduces the notion of the minimal consistent 
subset — the smallest subset of X for which nearest-prototype classification 
has training error. The PVM objective, X^ILi + Z^ILi ^« + X] 



(0 
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represents a sort of compromise, governed by A, between consistency (first 
two terms) and minimality (third term). In future work, we will investigate 
formulations similar to PVM more closely directed toward the goal of the 
minimal consistent subset. 

In a similar vein, an interesting connection can be drawn to the recent 



work of Weinberger and Saul 12009] in which they introduce large mar- 



gin nearest neighbor classification (LMNN), a novel approach to learning a 
metric that is well-suited to A;-NN. LMNN seeks a linear transformation of 
the feature space that brings same-class nearest neighbors closer together 
and makes opposing- class points farther apart with the goal of having each 
training point's k nearest neighborhood as homogenous (in class label) as 
possible. The motivating intuition is thus similar to that of the PVM, in 
particular properties (a) and (b) of Section |2] The obvious difference be- 
tween the methods is in what they output: LMNN learns a metric whereas 
PVM selects prototypes. 

Finally, we mention a few other nearest prototype methods, ii'-means 
and /C-medoids are common unsupervised methods which produce proto- 
types. Simply running these methods on each class separately yields proto- 
type sets Vi, . . . jVl- -fC-medoids is similar to PVM in that its prototypes 
are selected from a finite set. In contrast, iC-means's prototypes are not re- 
quired to lie on training points, making the method adaptive. Probably the 
most widely used prototype method is learning vector quantization (LVQ, 
Kohonen|200T| ). It is an adaptive prototype method as well. Several versions 



of LVQ exist, varying in certain details, but each begins with an initial set 
of prototypes and then iteratively adjusts them in a fashion that tends to 
encourage each prototype to lie near many training points of its class and 
away from training points of other classes. 

6 Examples on simulated and real data 

We compare the PVM's perfomance to some of the prototype methods men- 
tioned above. For X-medoids, we run pam of the R package cluster on each 
class's data separately, producing K prototypes per class. 

For LVQ, we use the functions Ivqinit and olvql (optimized learning 



vector quantization 1, Kohonen 2001) from the R package class. We vary 



the initial codebook size to produce a range of solutions. 
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Bayes Boundary PVM K-medoids LVQ 




Figure 2: Mixture of Gaussians training data. Classification boundaries of 
(left to right) Bayes, PVM-Greedy, K-medoids, LVQ (with Bayes boundary 
shown in gray for comparison). 

6.1 Mixture of Gaussians simulation 

For demonstration purposes, we consider a three-class example with p = 2. 
Each class was generated as a mixture of 10 Gaussians (details given in the 
Appendix) . Figure [T] shows the PVM solution for a range of values of the 
tuning parameter e. In Figure [2| we display the classification boundaries for 
the PVM, X-medoids, and LVQ (taking the lowest test error solution for 
each method). Since we generated this example from a known model, we 
are able to compute the Bayes boundary. We see that the PVM succeeds in 
capturing the shape of the boundary. The erratic boundary of i^-medoids 
highlights an advantage of the PVM over i^-medoids; the latter does not 
consider the relation between classes when choosing prototypes and therefore 
does not perform well when classes overlap. 

6.2 ZIP code digits data 

We apply the PVM to the USPS handwritten digits data set which consists 
of a training set of n = 7291 grayscale (16 x 16 pixel) images of handwritten 
digits 0-9 (and 2007 test images). We run the PVM for a range of values 
of e from the minimum interpoint distance (in which the PVM retains the 
entire training set and so reduces to 1-NN classification) to approximately 
the 14*'^ percentile of interpoint distances. 

The lefthand panel of Figure [3] shows the test error as a function of the 
number of prototypes for several methods using the Euclidean metric. Since 
both LVQ and K-means can place prototypes anywhere in the feature space, 
which is advantageous in high-dimensional problems, we also allow PVM to 
select prototypes that do not lie on the training points by augmenting Z. 
In this case, we run 10-means clustering on each class separately and then 



11 




Figure 3: Digits data set. (Left) All methods use Euclidean distance (Right) 
Both use tangent distance. The rightmost point on the PVM curves corre- 
spond to 1-NN classification. 



add these resulting 100 points to Z (in addition to X). 

The notion of the tangent distance between two such images was in- 



troduced by Simard et al. 1993 to account for certain invariances in this 
problem (e.g., the thickness and orientation of a digit are not relevant fac- 
tors when we consider how similar two digits are). Use of tangent distance 



with 1-NN attained the lowest test errors of any method [Hastie and Simard 



19981 . Since the PVM operates on an arbitrary dissimilarities matrix, we can 



easily use the tangent distance in place of the standard Euclidean metric. 
The righthand panel of Figure [3] shows the test errors when tangent distance 
is used. /C-medoids similarly readily accommodates any dissimilarity. While 
LVQ has been generalized to arbitrary differentiable metrics, there does not 
appear to be generic, off-the-shelf software available. The lowest test error 
attained by the PVM is 2.49% with a 3372-prototype solution (compared to 
1-NN's 3.09%) Also, we can see that for a wide range of e values we get a 
solution with test error comparable to that of 1-NN, but requiring far fewer 
prototypes. An advantageous feature of the PVM is that it automatically 
chooses the number of prototypes per class to use. In this example, it is 
interesting to see the class-frequencies of prototypes (see Table [T| . 

The most dramatic feature of this solution is that it only retains seven 
of the 1005 examples of the digit 1. This reflects the fact that, relative to 



Hastie and Simard 1998 report a 2.6% test error for 1-NN on this data set. The 



difference may be due to implementation details of the tangent distance. 



12 



Digit 





1 


2 


3 


4 


5 


6 


7 


8 


9 


Total 


Training set 


1194 


1005 


731 


658 


652 


556 


664 


645 


542 


644 


7291 


PVM-best 


493 


7 


661 


551 


324 


486 


217 


101 


378 


154 


3372 



Table 1: Number of prototypes chosen per class 
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Figure 4: First 56' (of 3372) PVM-Greedy prototypes. Above each is the 
number of training images first correctly covered by the addition of this pro- 
totype (in parentheses is the number of miscovered training points by this 
prototype). 

other digits, the digit 1 has the least variation when handwritten. Indeed, 
the average (tangent) distance between digit I's in the training set is less 
than half that of any other digit (the second least variable digit is 7) . 

In this example, we took Z = X, so that each prototype is an actual 
handwritten digit from the training set (rather than being some linear com- 
bination of many handwritten digits). Figures [4] and [5] show images of the 
first 88 prototypes (of 3372) selected by the greedy algorithm. Above each 
image of Figure |4]is the number of training images previously uncovered that 
were correctly covered by the addition of this prototype and, in parentheses, 
the number of training points that are miscovered by this prototype. For 
example, we can see that the first prototype selected by the greedy algo- 
rithm, which was a "1," covered 986 training images of I's and four training 
images that were not of I's. These four training images are shown in Figure 
[6] Indeed, all of them look very much like I's, which explains the algorithm's 
confusion. 

The lefthand panel of Figure [7] shows the improvement in the PVM 
objective, — Ary, after each step of the greedy algorithm, revealing an 
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First 88 Prototypes of PVM-Greedy 



5 B « 
H H SS 

ii ^ w 



n ^ " H 



II 



Figure 5: T/ie first 88 prototypes (out of 3372) of the PVM-Greedy solu- 
tion. We perform MDS (sammon, stress=0.07) on the tangent distances to 
visualize the prototypes in two dimensions. 



Correct: 2 Correct: 4 Correct: 4 Correct: 8 

nil III] 

Figure 6: The four training images that were miscovered by the first prototype 
of class 1 (see Figure^. 
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Progress of Greedy Algorithm 



Error on Test Set 




Figure 7: Progress of greedy as a function of number of protoypes added. 



interesting feature of the solution: we find that after the first 458 prototypes 
are added, each remaining prototype covers only one training point. Since 
in this example we took Z = X (and since a point always covers itself), this 
means that the final 2914 prototypes were chosen to cover only themselves. 
In this sense, we see that the PVM provides a sort of compromise between a 
sparse nearest prototype classifier and 1-NN. The compromise is determined 
by the prototype-cost parameter A. If A > 1, the algorithm does not enter 
the 1-NN regime. 

The righthand panel of Figure [7] shows the improvement in test error 
gained by running the greedy algorithm beyond the first 88 steps (corre- 
sponding to the A = 6 solution). It is interesting to look at elements of 
the test set that are misclassified when we use just the 88 prototypes but 
are correctly classified when using the complete PVM-greedy solution (with 
all 3372 prototypes). There are 276 (out of 2007) such elements. Figure [s] 
shows a randomly chosen seven examples of these test points. 



6.3 Protein Classification with String Kernels 

In our next example, we present a case in which the patterns are not natu- 



rally represented as vectors in R^. Leslie et al. 12004 study the problem of 



classification of proteins based on their amino acid sequences. They intro- 
duce a measure of similarity between protein sequences called the mismatch 
kernel. The general idea is that two sequences should be considered similar 
if they have a large number of short sequences in common (where two short 
sequences are considered the same if they have no more than a specified 
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Nearest Prototype Nearest Correct-Class Nearest Prototype 



Test Digit 


(of 88) Prototype (of 88) 


(of 3372) 
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Figure 8: i?ac/i row; corresponds to a test digit that is misclassified using just 
the first 88 prototypes (which are shown in Figure^. From left to right: the 
test digit itself, the nearest prototype among the 88 prototypes, the nearest 
prototype of the correct class (among the 88), and the nearest prototype in 
the full 3372-prototype solution. 

number of mismatches). We take as input a 1708 x 1708 matrix with Kij 
containing the value of the normalized mismatch kernel evaluated between 
proteins i and j (the data and software are from Leslie et al.|2004 |. The pro- 



teins fall into two classes, "Positive" and "Negative," according to whether 
they belong to a certain protein family. We compute pairwise distances from 
this kernel via Dij = -^jKa + Kjj — 2Kij and then run the PVM and K- 
medoids. Figure |9] shows the 10- fold cross- validated errors for the PVM and 
X-medoids. For the PVM, we take a range of equally-spaced quantiles of 
the pairwise distances from the minimum to the median for the parameter 
e. For i^-medoids, we take as parameter the fraction of proteins in each 
class that should be prototypes. This choice of parameter allows the classes 
to have different numbers of prototypes, which is important in this example 
because the classes are greatly imbalanced (only 45 of the 1708 proteins are 
in class "Positive"). The minimum CV-error (1.76%) is attained by PVM 
using about 870 prototypes (averaged over the 10 models fit for that value 
of e). This error is identical to the minimum CV-error of a support vec- 
tor machine (tuning the cost parameter) trained using this kernel. Fitting a 
model to the whole data set with the selected value of e, the PVM chooses 26 
prototypes (of 45) for class "Positive" and 907 (of 1663) for class "Negative." 
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Cross- Validation Error 




Figure 9: Proteins data set. Recall that the rightmost point on the PVM 
curve corresponds to 1-NN classification. 



6.4 UCI data sets 

Finally, we run the PVM on six data sets from the UCI Machine Learning 
Repository [Asuncion and Newman 2007 and compare its performance to 



that of 1-NN (i.e., retaining all training points as prototypes), X-medoids, 
and LVQ. We randomly select 2/3 of each data set for training and use the 
remainder as a test set. Ten-fold cross-validation (and the "1 standard error 
rule," Hastie et al.||2009 ) is performed on the training data to select a value 



for each method's tuning parameter (except for 1-NN). Table [2] reports the 
error on the test set and the number of prototypes selected for each method. 
We see that in most cases PVM is able to do as well as or better than 1-NN 
but with a significant reduction in prototypes. No single method does best 
on all of the data sets. 



7 Discussion 

We have introduced a new prototype method, which can be used both for 
classification and for "summarizing" a data set. The PVM is the solution 
to a set cover problem which describes our notion of a desirable prototype 
set. Applying the PVM to the digits data highlights some of its strengths. 
First, it has competitive test error for a wide range of values of the tuning 
parameter. Its success in this example stems in part from its flexibility: it 



17 



Data 


1-NN 


PVM 


AT-medoids 


LVQ 


Til fi r^p>lT>t; 




J t-i L H/ 1 1 U 1 [/CI 


28.9 


24.2 


33.2 


25.0 


(0 — 8 L = 




PfototypGS 


512 


12 


44 


29 


Glass 




Test Error (%) 


38.0 


36.6 


39.4 


35.2 


(p = 9,L = 


6) 


# Prototypes 


143 


34 


12 


17 


Heart 




Test Error (%) 


21.1 


21.1 


17.8 


15.6 


{p=U,L-- 


= 2) 


# Prototypes 


180 


6 


26 


12 


Liver 




Test Error (%) 


41.7 


41.7 


40.0 


33.9 


{p = 6,L = 


2) 


# Prototypes 


230 


16 


20 


110 


Vowel 




Test Error (%) 


2.8 


2.8 


2.8 


19.9 


(p=10,L = 


= 11) 


# Prototypes 


352 


352 


198 


193 


Wine 




Test Error (%) 


3.4 


11.9 


6.8 


3.4 


(p=13,L = 


= 3) 


# Prototypes 


119 


4 


12 


3 



Table 2: Test errors for the UCI data sets. For PVM, K-medoids, and LVQ, 
we used 10-fold cross validation (with the 1 SE rule) on the training set to 
tune the parameters. 



was easily used with a problem-specific measure of dissimilarity. Addition- 
ally, it automatically chooses a suitable number of prototypes for each class. 
Particularly useful for interpretation is the fact that each PVM-prototype is 
an observation in the training set (i.e., is an actual hand drawn image). In 
medical applications, this would mean that prototypes correspond to actual 
patients. This feature may be of great practical use to domain experts for 
making sense of large data sets. 

The PVM software will be made available as an R package in the R 
library. 
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A PVM's relation to prize-collecting set cover 

Claim: Solving the PVM integer program is equivalent to solving L prize- 
collecting set cover problems. 

Proof. Recall that the PVM integer program is given by 

n n 

minimize ^ ^ r?i + A ^ a^^ 
subject to 



j:xjeBe(zj) 



j:xj6Be(zj) 

af G{0,1} Vzj GZ,Zg{1,...,L} 
Now, the second set of inequality constraints is always tight, so we can 
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eliminate the slack variables r/i, . . . , 



n n 

minimize aj-^^ + A a 



(0 
■J 



subject to 



afe{0,l} yzjeZ,le{l,...,L} 



We can rewrite the second term of the objective as 

n n 
(0 



E E «f = EEHxies,(z,),z^y,}4 

1=1 j,l 



= E4 EHxieD.lej.x.^A',} 

= ^a«|i?,(z,)n(A'\^OI 
So the entire objective becomes 

n 

Y,C. + Y,[{\Bei^j)nix\Xi)\ + ^)c^!\ 

i=l j,l 

Letting C/(j) = A + \B^{zj) fl (-Y \ the integer program may be written 



as 



mmimize 



(0 f — 



yi^dXi i=l 

subject to, V / G {1, . . . , L}, 

j:xi6Be(zj) 

G {0, 1} V G Z 
> V X, G A-z. 
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Written in this way, we see that both the objective and the constraints are 
separable with respect to class, meaning that the solution to the above is 
equivalent to solving L integer programs (one for each / G {1, . . . , L}). The 
integer program has variables a^i , . . . , am and {^i : Xj E Xi} and is given 

by 

m 

minimize '^Ci{j)af^+ ^ 
subject to ^ a^'"* > 1 — V Xj G A'; 

af E {0, 1} V G Z 

ii>{) Vx, e;fz. 

This is precisely the prize-collecting set cover problem (with unit penalty 
for leaving a point uncovered). □ 



B Randomized Rounding Bound 



Claim: Given the randomized rounding procedure described in Section 3.1 
the objective on each iteration satisfies 

E[OBJ] < - + OPTip. 

Proof. Let {a*^'\ ??*} denote a solution to the LP and recall that for each 
iteration, we sample independently 

Af ~Bernoulh(a*^'^). 

The PVM objective on this iteration is given by 



0BJ = Y,iS^ + T,) + xY,4 



,(0 

j 

i=i j,i 



where 



S. = l^ if Xi uncovered ^ Ei:x,ei?,(z,) = 
I otherwise 

^. = E E ^? 



22 



Now, by linearity of expectation, we have 



E[OBJ] = E 



^(P[xj uncovered] + r]*) + \ ^ a 



<l) 



i=l 



3,1 



Since E[Ti\ = Ez^^,, EJ:x,eB.(z,^ 
Now, 



(0 



P(xj uncovered) = P (^A'f'^ = V j : G Se(zj)) 



using that 1— x < e ^ and, by LP feasibihty (Constraint 3a), that — J2j-xieBe{2: ■) Q^j'^^''* < 
-(1 - C*). Now, < < 1 and 

1 e — 1 

e^-i < - + X for < a; < 1 

e e 



so 



1 e - 1 
P(xj uncovered) < - H 4* 



from which it follows that 



E[OBJ] <^{l+ '—^it + rit) + A ^ 



e e 



i=l j,i 
n 



n 

a. 

e — ' " ' — ' •' 

1=1 j,i 



= '^ + OPTlp 

e 
ft 

<- + OPTip 
e 



□ 
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C Mixture of Gaussians example 

We generate the data of Section 6.1 in the style of |Hastie et al. [2009 , Section 
2.3.3. In particular, 

• Fix 3 class centers Mi, M2, M3 G R-^, sampled from A^(0, 16/2). 

• For each class independently generate m^^\ . . . , m'^Q ~ N{Mk, 12)- 

• For i = 1, . . . , n, choose j £ {1, . . . , 10} uniformly at random, then 
draw 

Xi|yi~ N{m'f'\l2/5). 



We take n = 300 in this case, with 100 points in each class. 
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